Part 1: Background

In this second vignette, we will be exploring real-world spatial data and an application of spatial statistics from a social science perspective. We will build toward fitting the Gaussian models we’ve studied to account for spatial autocorrelation (SAR, CAR, Lag SAR), and along the way will discuss modeling assumptions, contrast modeling decisions, and eventually compare potential models.

Our research question(s) will differ slightly based on your suggestions, but this vignette will touch on the geography of economic inequality and the legacy of discrimination, all within the Boston area. Specifically, we will model home values as a function of several socioeconomic characteristics at the neighborhood level. Note: We are purposefully not including physical characteristics of homes themselves - only neighborhood and individual characteristics will be used, aggregated at the neighborhood (Census Tract) level. The one exception we consider is the median year in which homes were built for each neighborhood, which we include as an illustrative example of handling missing data.

The file for this vignette combines data from two sources. The primary source is the US Census Bureau’s American Community Survey (ACS) 2021 estimates, which accounts for all but one of our covariates. These data are open-source and can be accessed at data.census.gov. The other source is the University of Richmond’s Digital Scholarship Lab, which has compiled and geo-referenced scans (that is, painstakingly converted images into shapefiles) of historic redlining maps from dozens of mid-20th American cities.

Redlining was a discriminatory process by which the United States government produced a number of maps of urban areas across the country, for use as lending guidelines for banks considering real estate investments. Beginning in 1935 and ending only in 1968 with the passing of the Fair Housing Act (a subsection of the famous Civil Rights Act of 1968), the process takes its name from the literal red lines drawn around neighborhoods deemed “most hazardous” or “undesirable”, which were disproportionately minority neighborhoods.

With this designation, prospective homeowners from redlined neighborhoods were routinely turned down for loans and unable to build equity. Decades of discriminatory lending have entrenched a racial gap in homeownership and wealth between white and non-white Americans, on average, which can be easily observed through open-source Census data, which we will be working with today.

Importantly, redlining was not the beginning of housing discrimination in the United States - racially restrictive deeds were regularly employed without the direction of redlining maps, for instance - but because the maps are so plentiful and so well preserved it is one of the most visible, from a historical perspective. We will use an indicator in our spatial linear models to include the proportion of each neighborhood, by area, which was given this “most hazardous” designation. When controlling for other important covariates (as we will), redlining may be significant but not the driving predictor of home values. But the fact that a simple indicator from a nearly 100 year old map is related to so many salient socioeconomic characteristics of modern life speaks to the complicated, interconnected nature of urban geography.

Boston’s redlining map. Source: University of Richmond.

Part 2: Setup

load("02-vignette.RData")

The sf dataframe which we will use to predict home values (boston_fulldata_sf) contains the following features, all measured at the geographic level of Census Tracts (our proxy for neighborhoods):

We should briefly explore the features of boston_fulldata_sf, noting their scales and ranges, missing values, their spatial patterns (through mapping), and their relationships with the target variable (through plotting).

summary(boston_fulldata_sf)
##     GEOID           median_homeval    median_hh_income mean_hh_income  
##  Length:437         Min.   : 225700   Min.   : 14590   Min.   : 35611  
##  Class :character   1st Qu.: 498000   1st Qu.: 73580   1st Qu.: 97531  
##  Mode  :character   Median : 621800   Median :100114   Median :125676  
##                     Mean   : 698540   Mean   :106068   Mean   :140267  
##                     3rd Qu.: 826500   3rd Qu.:128250   3rd Qu.:167205  
##                     Max.   :2000001   Max.   :250001   Max.   :360215  
##                                       NA's   :1                        
##  percapita_income pct_25plus_bachelors median_yearbuilt   gini_index    
##  Min.   :  7249   Min.   : 9.10        Min.   :1940     Min.   :0.2436  
##  1st Qu.: 38812   1st Qu.:39.00        1st Qu.:1950     1st Qu.:0.4094  
##  Median : 53806   Median :61.00        Median :1958     Median :0.4463  
##  Mean   : 58622   Mean   :57.74        Mean   :1962     Mean   :0.4550  
##  3rd Qu.: 72669   3rd Qu.:78.90        3rd Qu.:1970     3rd Qu.:0.4991  
##  Max.   :171967   Max.   :97.10        Max.   :2014     Max.   :0.7135  
##                                        NA's   :212                      
##  median_rent_inc_ratio pop_total_2021   pct_black        pct_lathisp     
##  Min.   :11.50         Min.   : 906   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:25.30         1st Qu.:2913   1st Qu.:0.02053   1st Qu.:0.04487  
##  Median :28.60         Median :3897   Median :0.04903   Median :0.08602  
##  Mean   :29.55         Mean   :4047   Mean   :0.11405   Mean   :0.14453  
##  3rd Qu.:32.00         3rd Qu.:5269   3rd Qu.:0.12222   3rd Qu.:0.18080  
##  Max.   :51.00         Max.   :8838   Max.   :0.80448   Max.   :0.86012  
##  NA's   :4                                                               
##    pct_white          pct_asian        pct_nonwhite     homeownership_rate
##  Min.   :0.004838   Min.   :0.00000   Min.   :0.03363   Min.   :0.01811   
##  1st Qu.:0.462998   1st Qu.:0.04217   1st Qu.:0.25922   1st Qu.:0.30062   
##  Median :0.634104   Median :0.10177   Median :0.36590   Median :0.45041   
##  Mean   :0.572691   Mean   :0.12139   Mean   :0.42731   Mean   :0.46393   
##  3rd Qu.:0.740783   3rd Qu.:0.17350   3rd Qu.:0.53700   3rd Qu.:0.62866   
##  Max.   :0.966368   Max.   :0.69057   Max.   :0.99516   Max.   :0.96933   
##                                                                           
##  homeownership_rate_black pct_pubtransit   avg_traveltime        D         
##  Min.   :0.0000           Min.   :0.0000   Min.   :15.28   Min.   :0.0000  
##  1st Qu.:0.0000           1st Qu.:0.1246   1st Qu.:27.41   1st Qu.:0.0000  
##  Median :0.2099           Median :0.2069   Median :30.75   Median :0.0000  
##  Mean   :0.2926           Mean   :0.2182   Mean   :30.82   Mean   :0.1962  
##  3rd Qu.:0.4823           3rd Qu.:0.2929   3rd Qu.:34.19   3rd Qu.:0.2671  
##  Max.   :1.0000           Max.   :0.6512   Max.   :45.61   Max.   :1.0000  
##  NA's   :44                                                                
##        C                  B                 A                    geometry  
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.00000   MULTIPOLYGON :437  
##  1st Qu.:0.006897   1st Qu.:0.00000   1st Qu.:0.00000   epsg:4269    :  0  
##  Median :0.291147   Median :0.00000   Median :0.00000   +proj=long...:  0  
##  Mean   :0.399254   Mean   :0.10369   Mean   :0.02413                      
##  3rd Qu.:0.792721   3rd Qu.:0.09362   3rd Qu.:0.00000                      
##  Max.   :1.000000   Max.   :0.98501   Max.   :0.92325                      
## 
# this command will loop over each column and compute the number of NAs
apply(boston_fulldata_sf, 2, function(x){sum(is.na(x))})
##                    GEOID           median_homeval         median_hh_income 
##                        0                        0                        1 
##           mean_hh_income         percapita_income     pct_25plus_bachelors 
##                        0                        0                        0 
##         median_yearbuilt               gini_index    median_rent_inc_ratio 
##                      212                        0                        4 
##           pop_total_2021                pct_black              pct_lathisp 
##                        0                        0                        0 
##                pct_white                pct_asian             pct_nonwhite 
##                        0                        0                        0 
##       homeownership_rate homeownership_rate_black           pct_pubtransit 
##                        0                       44                        0 
##           avg_traveltime                        D                        C 
##                        0                        0                        0 
##                        B                        A                 geometry 
##                        0                        0                        0

We can see that most of our covariates have pretty good coverage, with only a few missing values. Census data are often nice in this way. However, those data which are missing could cause problems if we attempt to use them in spatial linear models, which we will later encounter. Now let’s inspect the spatial layout of a few interesting variables with mapview.

# Map the target variable
mapview(boston_fulldata_sf, zcol = "median_homeval")

The target variable has some clear spatial dependence. The distribution of median home values almost looks fairly continuous with neighboring areas often having similar values. We should also look at some of the covariates that we might be interested in using in our models. The following code maps median household income and can be easily modified to explore other covariates by supply the variable name to the zcol argument.

# Map an interesting covariate
mapview(boston_fulldata_sf, zcol = "median_hh_income")

Let’s also do some simple scatterplots of these relationships. First we define the helper function plot_covariates_simply, which can be easily modified below for any other combination of variables.

plot_covariates_simply <- function(var_x = median_hh_income,
                                   var_y = median_homeval) {
  ggplot(data = boston_fulldata_sf,
         aes(x = {{var_x}}, y = {{var_y}})) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "loess", se = F)
  }

plot_covariates_simply(var_x = median_hh_income,
                       var_y = median_homeval)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

The Plan

From what is commonly known of neighborhood home values, we can reasonably suspect that there may be spatial autocorrelation between or a trend. We will fit baseline models, then test the residuals of those models for spatial dependence using the package sfdep. If spatial dependence is present, then we will make particular modeling decisions to represent the neighbor and weighting scheme, and then fit several spatial linear models (SAR, CAR, Lag SAR). Next, we will incorporate the covariate indicating redlining status, and then perform a likelihood ratio test to judge its usefulness. Finally, we will compare between different models based on AIC and log-likelihood.

Along the way we will have to deal with two main roadblocks: missing data and unconnected polygons. We will handle and discuss these modeling decisions and others.

Part 3: Assessing spatial dependence

We start by determining neighbor and weighting schemes as we have learned in class, using functions from sfdep:

# Define contiguity
boston_contig <- st_contiguity(boston_fulldata_sf)
# Define a binary weighting scheme (using style = "B")
boston_weights.B <- st_weights(boston_contig, style = "B")
## Error in nb2listw(nb, style = style, zero.policy = allow_zero, ...): Empty neighbour sets found

Why are we getting this error? Let’s inspect the contiguity we defined:

boston_contig
## Neighbour list object:
## Number of regions: 437 
## Number of nonzero links: 2364 
## Percentage nonzero weights: 1.237897 
## Average number of links: 5.409611 
## 1 region with no links:
## 396

We can see from the printout that we have 1 region with no links. Since we’ve chosen a binary weighting scheme in which regions that touch are linked, it sounds like we have an island. Let’s inspect visually. We’ll plot the offending Census tract alone and then also plot all Census tracts overlaid, so that we can see what’s going on. (mapview tip: Zoom to the extent of the single tract by clicking its name in the bottom-righthand corner, then check and uncheck the overlaid layer to compare.)

mapview(boston_fulldata_sf) + mapview(boston_fulldata_sf[396, ])

Surprisingly, it’s not a real island! This is an artifact of our modeling decisions. This particular modeling decision was made for you in the data preprocessing step, in which I dropped Census tracts with no data for the target variable. While this wasn’t an issue in most cases, in this one case we dropped tracts in such a way as to create an island.

How would you deal with the “island”? Compare and contrast the following approaches:

Would your answers change if the island were very highly populated? What about if there were many more islands?

You Choose

Here’s your first modeling choice of your own. Pick a method for dealing with this missing data problem and define a contiguity and weighting scheme. Remember what you’ve chosen as we’ll do a class-wide analysis at the end! Run option 1 at least (we’ll need it to compare in a minute), and then select your favorite choice of weighting scheme.

# 1) Delete the island tract from your dataset and re-specify the contiguity
boston_fulldata_sf <- boston_fulldata_sf %>% 
  filter(GEOID != "25025010408") # leave this uncommented, even if you choose another option
boston_contig <- st_contiguity(boston_fulldata_sf) # this too

# now choose:
boston_weights <- st_weights(boston_contig, style = "B")
boston_weights <- st_weights(boston_contig, style = "W")
boston_weights <- st_weights(boston_contig, style = "C")
boston_weights <- st_weights(boston_contig, style = "U")
boston_weights <- st_weights(boston_contig, style = "S")

# 2) Use a different weighting scheme besides binary, perhaps based off of distance rather than intersection
# Here we use an inverse-distance weighting scheme, which doesn't mind islands

boston_dist.range <- st_dist_band(st_centroid(boston_fulldata_sf),
                                  lower = 0, # specify lower bound of range (m)
                                  upper = 2e4) # specify upper bound of range (m)
boston_weights <- st_inverse_distance(nb = boston_dist.range,
                                      geometry = st_centroid(boston_fulldata_sf),
                                      scale = 10^3) # rescale units to km

# 3) Keep the island tract but assign it as not being a neighbor of any other tract.
# Not implemented, but one could use the `allow_zero = TRUE` argument.

# 4) Impute values for the missing observations so that no dropping is necessary
# Not implemented. Dropping them was done as a preprocessing step, so undoing it is left as a thought experiment.

Now that we’ve dealt with the contiguity issue, we can perform our first Moran’s I test. (We can see that our modeling choices are starting to propagate through the analysis - it’s a good idea to think carefully early on about how the impact of your choices will scale!)

global_moran_test(boston_fulldata_sf$median_homeval,
                  nb = boston_contig,
                  wt = boston_weights)
## 
##  Moran I test under randomisation
## 
## data:  x  
## weights: listw    
## 
## Moran I statistic standard deviate = 20.097, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5981381856     -0.0022988506      0.0008926692

The preliminary Moran’s I test confirms what we already knew: there is strong evidence of some spatial trend or correlation in median home values of neighborhoods. Another way to confirm this would be with a spatial lag plot:

boston_fulldata_sf %>% 
  mutate(lag = st_lag(median_homeval,
                      nb = boston_contig,
                      wt = boston_weights)) %>% 
  ggplot(data = .,
         aes(x = median_hh_income,
             y = lag)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = F) +
  labs(x = "Median household income",
       y = "Spatial Lag",
       title = "Spatial lag, median home value\nvs. median household income")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

The shape of the trend in the spatial lag plot can look fairly different depending on choice of weighting.

Part 4: Baseline modeling with OLS, ignoring spatial structure at first

Now it’s time to start putting together some linear models. Remember the plan: start with non-spatial models, then test for spatial dependence of residuals. You’ll have some flexibility here too!

First, we set up some very basic baseline models:

# raw baseline, mean of target variable
lm0 <- lm(data = boston_fulldata_sf,
          formula = median_homeval ~ 1) # if no predictors, we just get the mean
summary(lm0)
## 
## Call:
## lm(formula = median_homeval ~ 1, data = boston_fulldata_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473064 -201039  -76064  127836 1301237 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   698764      13525   51.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 282400 on 435 degrees of freedom
# naive baseline, predicting log of the target variable with log-mean income
# Note: While we use log-mean income here because it is nicer mathematically,
# median income is more interpretable and might be worth using if you need
# to communicate results. Often results will agree more or less
lm1 <- lm(data = boston_fulldata_sf,
          formula = log(median_homeval) ~ log(mean_hh_income))
summary(lm1)
## 
## Call:
## lm(formula = log(median_homeval) ~ log(mean_hh_income), data = boston_fulldata_sf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94033 -0.16409 -0.02766  0.12581  0.87600 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.94289    0.31783   18.70   <2e-16 ***
## log(mean_hh_income)  0.63306    0.02701   23.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2409 on 434 degrees of freedom
## Multiple R-squared:  0.5587, Adjusted R-squared:  0.5577 
## F-statistic: 549.4 on 1 and 434 DF,  p-value: < 2.2e-16

Then, more full models, both with and without the redlining covariate, called D:

lm2 <- lm(data = boston_fulldata_sf,
          formula = log(median_homeval) ~ log(mean_hh_income) + pct_25plus_bachelors +
            pct_nonwhite + avg_traveltime + homeownership_rate)
summary(lm2)
## 
## Call:
## lm(formula = log(median_homeval) ~ log(mean_hh_income) + pct_25plus_bachelors + 
##     pct_nonwhite + avg_traveltime + homeownership_rate, data = boston_fulldata_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9198 -0.1236 -0.0115  0.1238  0.6683 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.0498915  0.5496405  11.007  < 2e-16 ***
## log(mean_hh_income)   0.6419890  0.0501120  12.811  < 2e-16 ***
## pct_25plus_bachelors  0.0034564  0.0008582   4.027 6.67e-05 ***
## pct_nonwhite          0.2698624  0.0669328   4.032 6.55e-05 ***
## avg_traveltime       -0.0126894  0.0021161  -5.997 4.27e-09 ***
## homeownership_rate   -0.2916525  0.0711049  -4.102 4.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.206 on 430 degrees of freedom
## Multiple R-squared:  0.6803, Adjusted R-squared:  0.6766 
## F-statistic:   183 on 5 and 430 DF,  p-value: < 2.2e-16
lm3 <- lm(data = boston_fulldata_sf,
          formula = log(median_homeval) ~ log(mean_hh_income) + pct_25plus_bachelors +
            pct_nonwhite + avg_traveltime + homeownership_rate + D)
summary(lm3)
## 
## Call:
## lm(formula = log(median_homeval) ~ log(mean_hh_income) + pct_25plus_bachelors + 
##     pct_nonwhite + avg_traveltime + homeownership_rate + D, data = boston_fulldata_sf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94675 -0.12273 -0.00625  0.12174  0.68824 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.066042   0.548134  11.067  < 2e-16 ***
## log(mean_hh_income)   0.635392   0.050093  12.684  < 2e-16 ***
## pct_25plus_bachelors  0.003691   0.000865   4.267 2.44e-05 ***
## pct_nonwhite          0.264194   0.066810   3.954 8.97e-05 ***
## avg_traveltime       -0.011855   0.002157  -5.496 6.68e-08 ***
## homeownership_rate   -0.264471   0.072385  -3.654  0.00029 ***
## D                     0.060818   0.032622   1.864  0.06296 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2054 on 429 degrees of freedom
## Multiple R-squared:  0.6829, Adjusted R-squared:  0.6785 
## F-statistic:   154 on 6 and 429 DF,  p-value: < 2.2e-16

You Choose

Now define your own linear model based on predictors that you think might be interesting. If you need a reminder, try names(boston_fulldata_sf) or summary(boston_fulldata_sf). However, don’t include the variable D just yet. We’ll do that last.

As a bonus, you may even be interested to do some feature engineering here. It may be the case that whether or not a neighborhood is a suburb is a fact that is predictive of home value (at least, it’s plausible). In absence of a variable measuring this directly, one could compute it fairly simply and add it to the design matrix and specify the new feature in your models. (Although, this does raise even more questions and require more modeling decisions - is this a binary variable defined at a threshold? Or is it a continuous variable to be squared in the formula perhaps?)

### OPTIONAL
# Location of the center of downtown, given as an sf point
downtown_location <- data.frame(lon = c(-71.05766), lat = c(42.35987)) %>% 
  st_as_sf(coords = c(x = "lon", y = "lat")) %>% 
  st_set_crs(value = st_crs(boston_fulldata_sf))

# If you were to attempt this, you might try st_centroid() followed by st_distance()
# you could save that result as a vector, and then append it to boston_fulldata_sf

# distance_from_downtown <- ...
# boston_fulldata_sf$dist <- distance_from_downtown
### OPTIONAL

Additionally, you may be interested in using the “median year home were built” variable, but might have noticed that it is missing many observations. How should we proceed? As in the first missing data problem, we have a number of options, although this time to simply subset for neighborhoods where data exist would halve our sample size and likely obscure any potential spatial structure. We could try the following naive imputation, replacing missing values, with the sample mean for that covariate, ignoring the influence of neighbors and thereby yielding the same results regardless of spatial layout. Is there another way?

boston_fulldata_sf <- boston_fulldata_sf %>% 
  mutate(median_yearbuilt = replace_na(median_yearbuilt,
                                       mean(median_yearbuilt, na.rm = TRUE)))

Back to the main thing, specifying a formula to be used in the non-spatial and spatial models:

# Formula example
formula_ex <- as.formula(log(median_homeval) ~ sqrt(mean_hh_income) + log(pop_total_2021))
# Your formula
myformula <- as.formula(___)
## Error: <text>:4:25: unexpected input
## 3: # Your formula
## 4: myformula <- as.formula(_
##                            ^

Fit a non-spatial linear model with your specified formula:

lm_myformula <- lm(data = boston_fulldata_sf,
                   formula = myformula)
summary(lm_myformula)
## 
## Call:
## lm(formula = myformula, data = boston_fulldata_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9198 -0.1236 -0.0115  0.1238  0.6683 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.0498915  0.5496405  11.007  < 2e-16 ***
## log(mean_hh_income)   0.6419890  0.0501120  12.811  < 2e-16 ***
## pct_25plus_bachelors  0.0034564  0.0008582   4.027 6.67e-05 ***
## pct_nonwhite          0.2698624  0.0669328   4.032 6.55e-05 ***
## avg_traveltime       -0.0126894  0.0021161  -5.997 4.27e-09 ***
## homeownership_rate   -0.2916525  0.0711049  -4.102 4.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.206 on 430 degrees of freedom
## Multiple R-squared:  0.6803, Adjusted R-squared:  0.6766 
## F-statistic:   183 on 5 and 430 DF,  p-value: < 2.2e-16
lm_plus_redlining <- lm(data = boston_fulldata_sf,
                        formula = update(myformula, ~ . + D)) # add redlining, "D"
summary(lm_plus_redlining)
## 
## Call:
## lm(formula = update(myformula, ~. + D), data = boston_fulldata_sf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94675 -0.12273 -0.00625  0.12174  0.68824 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.066042   0.548134  11.067  < 2e-16 ***
## log(mean_hh_income)   0.635392   0.050093  12.684  < 2e-16 ***
## pct_25plus_bachelors  0.003691   0.000865   4.267 2.44e-05 ***
## pct_nonwhite          0.264194   0.066810   3.954 8.97e-05 ***
## avg_traveltime       -0.011855   0.002157  -5.496 6.68e-08 ***
## homeownership_rate   -0.264471   0.072385  -3.654  0.00029 ***
## D                     0.060818   0.032622   1.864  0.06296 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2054 on 429 degrees of freedom
## Multiple R-squared:  0.6829, Adjusted R-squared:  0.6785 
## F-statistic:   154 on 6 and 429 DF,  p-value: < 2.2e-16
(LRT_lm <- anova(lm_myformula, lm_plus_redlining, test = "LRT"))

Now that we’ve fit several models, we can check their residuals for spatial dependence:

global_moran_test(residuals(lm1),
                  nb = boston_contig,
                  wt = boston_weights, alternative = "two.sided")
## 
##  Moran I test under randomisation
## 
## data:  x  
## weights: listw    
## 
## Moran I statistic standard deviate = 13.373, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3981558560     -0.0022988506      0.0008967494
cbind(residuals(lm2),
      st_lag(residuals(lm2), nb = boston_contig, wt = boston_weights)) %>% 
  as.data.frame() %>% 
  ggplot(data = ., aes(x = V1, y = V2)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = F) +
  labs(x = "Residuals",
       y = "Spatial Lag",
       title = "Spatial lag plot of residuals of lm")
## `geom_smooth()` using formula 'y ~ x'

At this stage, is this code able to perform an Moran’s I test or create a spatial lag plot or not? The answer will depend on your choices of weighting and linear model. If it’s not working, what error are you getting, and why do you think that is?

qqnorm(residuals(lm3))
qqline(residuals(lm3))

Based on the Moran’s I, the spatial lag plot of residuals, and perhaps the QQ plot, what can be said about the residuals of these non-spatial models?

Part 5: Spatial models

Assuming that your linear models did not fully explain the spatial structure of the residuals, we could probably benefit from modeling spatial dependence directly. We will use the formula(s) we’ve specified above to fit analogous SAR, CAR, and Lag SAR models based on your chosen formula. Then, we will add the redlining covariate to obtain an additional model for each. Finally, we will compare results of all models with AIC and log-likelihood.

sar_myformula <- spautolm(data = boston_fulldata_sf,
                 formula = myformula, # match your chosen formula
                 family = "SAR",
                 listw = recreate_listw(nb = boston_contig,
                                        wt = boston_weights))
summary(sar_myformula)
## 
## Call: 
## spautolm(formula = myformula, data = boston_fulldata_sf, listw = recreate_listw(nb = boston_contig, 
##     wt = boston_weights), family = "SAR")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.987591 -0.122840 -0.010888  0.107730  0.647015 
## 
## Coefficients: 
##                         Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)           6.63211284  0.57147812 11.6052 < 2.2e-16
## log(mean_hh_income)   0.60153281  0.05228694 11.5045 < 2.2e-16
## pct_25plus_bachelors  0.00181632  0.00095428  1.9033 0.0569960
## pct_nonwhite          0.11865566  0.07788414  1.5235 0.1276363
## avg_traveltime       -0.01140553  0.00231421 -4.9285 8.288e-07
## homeownership_rate   -0.26713891  0.07853651 -3.4015 0.0006703
## 
## Lambda: 0.37136 LR test value: 21.928 p-value: 2.8308e-06 
## Numerical Hessian standard error of lambda: 0.076327 
## 
## Log likelihood: 84.19758 
## ML residual variance (sigma squared): 0.03864, (sigma: 0.19657)
## Number of observations: 436 
## Number of parameters estimated: 8 
## AIC: -152.4
sar_plus_redlining <- spautolm(data = boston_fulldata_sf,
                               formula = update(myformula, ~ . + D), # add redlining, "D"
                               family = "SAR",
                               listw = recreate_listw(nb = boston_contig,
                                                      wt = boston_weights))
summary(sar_plus_redlining)
## 
## Call: spautolm(formula = update(myformula, ~. + D), data = boston_fulldata_sf, 
##     listw = recreate_listw(nb = boston_contig, wt = boston_weights), 
##     family = "SAR")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.002432 -0.125253 -0.009043  0.112171  0.657878 
## 
## Coefficients: 
##                         Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)           6.61512644  0.56946970 11.6163 < 2.2e-16
## log(mean_hh_income)   0.59962543  0.05214232 11.4998 < 2.2e-16
## pct_25plus_bachelors  0.00203347  0.00095493  2.1295  0.033217
## pct_nonwhite          0.11836027  0.07742985  1.5286  0.126360
## avg_traveltime       -0.01109169  0.00231572 -4.7897  1.67e-06
## homeownership_rate   -0.25330856  0.07874237 -3.2169  0.001296
## D                     0.05563154  0.03669681  1.5160  0.129525
## 
## Lambda: 0.36187 LR test value: 20.686 p-value: 5.4099e-06 
## Numerical Hessian standard error of lambda: 0.076421 
## 
## Log likelihood: 85.33592 
## ML residual variance (sigma squared): 0.0385, (sigma: 0.19622)
## Number of observations: 436 
## Number of parameters estimated: 9 
## AIC: -152.67

How do the two models compare? Additionally, we can perform a likelihood ratio test:

(LRT_SAR <- LR.Sarlm(sar_myformula, sar_plus_redlining))
## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = -2.2767, df = 1, p-value = 0.1313
## sample estimates:
##      Log likelihood of sar_myformula Log likelihood of sar_plus_redlining 
##                             84.19758                             85.33592

From the likelihood ratio test above, does adding the redlining covariate improve the model significantly? We next repeat the process for CAR and Lag SAR models as well.

car_myformula <- spautolm(data = boston_fulldata_sf,
                 formula = myformula, # match your chosen formula
                 family = "CAR",
                 listw = recreate_listw(nb = boston_contig,
                                        wt = boston_weights))
## Warning in spautolm(data = boston_fulldata_sf, formula = myformula, family =
## "CAR", : Non-symmetric spatial weights in CAR model
summary(car_myformula)
## 
## Call: 
## spautolm(formula = myformula, data = boston_fulldata_sf, listw = recreate_listw(nb = boston_contig, 
##     wt = boston_weights), family = "CAR")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.941266 -0.121004 -0.011185  0.117774  0.664160 
## 
## Coefficients: 
##                         Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)           5.82064778  0.54749411 10.6314 < 2.2e-16
## log(mean_hh_income)   0.65823994  0.04992362 13.1849 < 2.2e-16
## pct_25plus_bachelors  0.00356193  0.00085675  4.1575 3.218e-05
## pct_nonwhite          0.31341271  0.06693379  4.6824 2.835e-06
## avg_traveltime       -0.01251439  0.00211134 -5.9272 3.081e-09
## homeownership_rate   -0.27475143  0.07097490 -3.8711 0.0001083
## 
## Lambda: 0.029182 LR test value: 0.84578 p-value: 0.35775 
## Numerical Hessian standard error of lambda: 0.25205 
## 
## Log likelihood: 73.65651 
## ML residual variance (sigma squared): 0.041759, (sigma: 0.20435)
## Number of observations: 436 
## Number of parameters estimated: 8 
## AIC: -131.31
car_plus_redlining <- spautolm(data = boston_fulldata_sf,
                               formula = update(myformula, ~ . + D), # add redlining, "D"
                               family = "CAR",
                               listw = recreate_listw(nb = boston_contig,
                                                      wt = boston_weights))
## Warning in spautolm(data = boston_fulldata_sf, formula = update(myformula, :
## Non-symmetric spatial weights in CAR model
summary(car_plus_redlining)
## 
## Call: spautolm(formula = update(myformula, ~. + D), data = boston_fulldata_sf, 
##     listw = recreate_listw(nb = boston_contig, wt = boston_weights), 
##     family = "CAR")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.9629636 -0.1201301 -0.0036255  0.1195497  0.6819370 
## 
## Coefficients: 
##                         Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)           5.85455585  0.54522903 10.7378 < 2.2e-16
## log(mean_hh_income)   0.65102211  0.04983204 13.0643 < 2.2e-16
## pct_25plus_bachelors  0.00376000  0.00086195  4.3622 1.288e-05
## pct_nonwhite          0.30449908  0.06669510  4.5655 4.982e-06
## avg_traveltime       -0.01179408  0.00214748 -5.4921 3.973e-08
## homeownership_rate   -0.25221968  0.07210621 -3.4979  0.000469
## D                     0.05441685  0.03252307  1.6732  0.094292
## 
## Lambda: 0.026653 LR test value: 0.74693 p-value: 0.38745 
## Numerical Hessian standard error of lambda: 0.17204 
## 
## Log likelihood: 75.36618 
## ML residual variance (sigma squared): 0.041434, (sigma: 0.20355)
## Number of observations: 436 
## Number of parameters estimated: 9 
## AIC: -132.73
(LRT_CAR <- LR.Sarlm(car_myformula, car_plus_redlining))
## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = -3.4194, df = 1, p-value = 0.06444
## sample estimates:
##      Log likelihood of car_myformula Log likelihood of car_plus_redlining 
##                             73.65651                             75.36618
lagsar_myformula <- lagsarlm(data = boston_fulldata_sf,
                             formula = myformula, # match your chosen formula
                             listw = recreate_listw(nb = boston_contig,
                                                    wt = boston_weights))

summary(lagsar_myformula)
## 
## Call:
## lagsarlm(formula = myformula, data = boston_fulldata_sf, listw = recreate_listw(nb = boston_contig, 
##     wt = boston_weights))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.923357 -0.117932 -0.007385  0.103696  0.647759 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                        Estimate Std. Error z value  Pr(>|z|)
## (Intercept)           1.9910526  0.6910115  2.8814 0.0039596
## log(mean_hh_income)   0.5471283  0.0476729 11.4767 < 2.2e-16
## pct_25plus_bachelors  0.0005571  0.0008189  0.6803 0.4963137
## pct_nonwhite          0.1750670  0.0613152  2.8552 0.0043010
## avg_traveltime       -0.0075664  0.0020159 -3.7533 0.0001745
## homeownership_rate   -0.2286917  0.0654473 -3.4943 0.0004753
## 
## Rho: 0.38787, LR test value: 57.416, p-value: 3.5305e-14
## Asymptotic standard error: 0.046426
##     z-value: 8.3547, p-value: < 2.22e-16
## Wald statistic: 69.8, p-value: < 2.22e-16
## 
## Log likelihood: 101.9415 for lag model
## ML residual variance (sigma squared): 0.035517, (sigma: 0.18846)
## Number of observations: 436 
## Number of parameters estimated: 8 
## AIC: -187.88, (AIC for lm: -132.47)
## LM test for residual autocorrelation
## test value: 7.4237, p-value: 0.006437
lagsar_plus_redlining <- lagsarlm(data = boston_fulldata_sf,
                                  formula = update(myformula, ~ . + D), # add redlining, "D"
                                  listw = recreate_listw(nb = boston_contig,
                                                         wt = boston_weights))

summary(lagsar_plus_redlining)
## 
## Call:lagsarlm(formula = update(myformula, ~. + D), data = boston_fulldata_sf, 
##     listw = recreate_listw(nb = boston_contig, wt = boston_weights))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.9421880 -0.1205148 -0.0079426  0.1026895  0.6619661 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                         Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)           2.05202401  0.69195024  2.9656 0.0030213
## log(mean_hh_income)   0.54366144  0.04768847 11.4003 < 2.2e-16
## pct_25plus_bachelors  0.00075714  0.00082679  0.9158 0.3597946
## pct_nonwhite          0.17225077  0.06125595  2.8120 0.0049237
## avg_traveltime       -0.00704365  0.00205512 -3.4274 0.0006095
## homeownership_rate   -0.21039686  0.06666534 -3.1560 0.0015994
## D                     0.04265713  0.02991721  1.4258 0.1539148
## 
## Rho: 0.38313, LR test value: 55.922, p-value: 7.5384e-14
## Asymptotic standard error: 0.046592
##     z-value: 8.2231, p-value: 2.2204e-16
## Wald statistic: 67.619, p-value: 2.2204e-16
## 
## Log likelihood: 102.9535 for lag model
## ML residual variance (sigma squared): 0.035382, (sigma: 0.1881)
## Number of observations: 436 
## Number of parameters estimated: 9 
## AIC: -187.91, (AIC for lm: -133.99)
## LM test for residual autocorrelation
## test value: 7.4339, p-value: 0.0064008
(LRT_LagSAR <- LR.Sarlm(lagsar_myformula, lagsar_plus_redlining))
## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = -2.024, df = 1, p-value = 0.1548
## sample estimates:
##      Log likelihood of lagsar_myformula Log likelihood of lagsar_plus_redlining 
##                                101.9415                                102.9535

How have things turned out? Let’s report all our metrics together:

Model aic loglik
OLS -132.4672 73.23361
OLS+D -133.9854 74.99272
SAR -152.3952 84.19758
SAR+D -152.6718 85.33592
CAR -131.3130 73.65651
CAR+D -132.7324 75.36618
Lag SAR -187.8830 101.94151
Lag SAR+D -187.9070 102.95352

Part 6: Limitations of modeling choices

If you’ve been able to tinker with the specifications of these models, then you may have results that are significantly different from your classmates. Is that surprising or expected? Did you get any particularly unusual results?

We’ve made a lot of design choices in this process, and while each have been justified, the many other similarly justified choices could have led us somewhere else. Let’s talk about those choices, and the limitations of these models.

Modeling decisions and limitations:

What other modeling decisions did we make, that you recall? How much uncertainty do you think they introduce?

Part 7: Final thoughts

It is very likely that we still have omitted variables, and we would probably be wise to keep omitted variable bias in mind when interpreting any of the results of these models just yet. That said, we know a few things. First, we know that outcomes at the neighborhood level are spatially linked and autocorrelation seems non-trivial to remove. Cities are complicated! Second, we know that formerly redlined neighborhoods are associated with slightly lower economic outcomes than other neighborhoods, and we know some of the historical context for why that is likely the case. It is striking that nearly 100 years can pass without changing the fortunes of some neighborhoods. However, it also does not appear that the maps are a quick fix to being able to quantify and locate discrimination - if they were, then we would expect them to be more predictive of home prices or other covariates of interest. As is often the case, the truth is more complicated and progress is not simple!